Massive Data Set

Management and Analysis in the Context of Global Change

Carl Boettiger, UC Santa Cruz

20 May 2014

Global Change

Overpeck+ (2011) doi:[10.1126/science.1197869](http://doi.org/10.1126/science.1197869)

Fishery Collapse?

Worm+ (2006) doi:[10.1126/science.1132294](http://doi.org/10.1126/science.1132294)

Tipping points?

Barnoksy+ (2012) doi:[10.1038/nature11018](http://doi.org/10.1038/nature11018)

Me

Early warning signs?

Ecological management under uncertainty:

credit: NOAA

Decision Theory

Remote sensors

credit: NASA

micro sensors

credit: NSF

NEON

credit: Hopkin (2006) doi:[10.1038/444420a](http://doi.org/10.1038/444420a)

OOI

credit: Witze (2013) doi:[10.1038/501480a](http://doi.org/10.1038/501480a)

Computer simulations

credit: NERSC

Field-based study

credit: Scambos & Bauer, NSIDC

Growth of climate data by type

Overpeck+ (2011) doi:[10.1126/science.1197869](http://doi.org/10.1126/science.1197869)

What is “a lot” of data?

100 MB?

100 GB?

100 PB?

Big enough to be a problem

Engineering bottlenecks

Baraniuk (2011) doi:[10.1126/science.1197448](http://doi.org/10.1126/science.1197448)

Science bottlenecks

adapted from Reichman+ (2011) doi:[10.1126/science.1197962](http://doi.org/10.1126/science.1197962)

Different ways to be big

Volume

Variety

Velocity

data visualization

today the visualization and analysis component has become a bottleneck

Fox & Hendler (2011) doi:[10.1126/science.1197654](http://doi.org/10.1126/science.1197654)

data visualization

Something is wrong with this picture

Fox & Hendler (2011) doi:[10.1126/science.1197654](http://doi.org/10.1126/science.1197654)

Most scientific data is created in a form that facilitates its generation rather than focusing on its eventual use.

Fox & Hendler (2011) doi:[10.1126/science.1197654](http://doi.org/10.1126/science.1197654)

Variety

Factory Farm Data…

credit: Arthus-Bertrand

… Organic, hand-crafted variety

credit: Arthus-Bertrand

Vertically integrated data repositories

The rOpenSci project

building tools, building community

World Bank Climate Data (Example)

IPCC records and model projections at your fingertips

library("rWBclimate)
country.list <- c("USA", "CAN)
country.dat <- get_historical_temp(country.list, "year)

ggplot(country.dat, aes(x = year, y = data, group = locator)) +
  geom_point() + geom_path() + xlab("Year) +
  ylab("Average annual temperature) +
  stat_smooth(se = F, colour = "black) +
  facet_wrap(~locator, scale = "free) + theme_bw()

World Bank Climate Data (Example)

IPCC records and model projections at your fingertips

UN FAO Fisheries data (Example)

library("rfisheries)
species <- of_species_codes()
who <- c("TUX", "COD", "VET", "NPA)
by_species <- lapply(who, function(x) of_landings(species = x))

names(by_species) <- who
dat <- melt(by_species, id = c("catch", "year))[, -5]
names(dat) <- c("catch", "year", "species", "a3_code)
ggplot(dat, aes(year, catch)) + geom_line() +
  facet_wrap(~a3_code, scales = "free_y) + theme_bw()

UN FAO Fisheries data (Example)

GBIF

  • library("rgbif)

Limitations to vertical integration

Metadata-driven repositories

Metadata Standards

Jones+ (2006) doi:[10.1146/annurev.ecolsys.37.091305.110031](http://doi.org/10.1146/annurev.ecolsys.37.091305.110031)

KNB

Quality vs Quantity

  • There are no good data, bad data
  • Some are just missing metadata

Communicating data limitations

effective interdisciplinary communication of data limitations with regard to, for example,

  • spatial and temporal sampling uncertainties;
  • instrument changes;
  • quality-control procedures; and, in particular,
  • what model-based climate predictions or projections do well and not so well.

Overpeck+ (2011) doi:[10.1126/science.1197869](http://doi.org/10.1126/science.1197869)

EML

  • Easy to use existing metadata

  • Easy to generate metadata

  • Easy to enhance & improve data with further annotation

Compare this to the standard publication…

How a new student sees a paper

How a leading professor sees it

How a computer sees it

Metaknowledge

Evans+ doi:[10.1126/science.1201765](http://doi.org/10.1126/science.1201765)

Formal semantics

Mitchner (2012) doi:[10.1016/j.tree.2011.11.016](http://doi.org/10.1016/j.tree.2011.11.016)

Data sharing

Although research scientists have been the main users of these data, an increasing number of resource managers (working in fields such as water, public lands, health, and marine resources) need and are seeking access to climate data to inform their decisions, just as a growing range of policy-makers rely on climate data to develop climate change strategies

- Overpeck+ (2011) doi:[10.1126/science.1197869](http://doi.org/10.1126/science.1197869)

Data Sharing?

  • Whitehouse mandates
  • Journal mandates
  • Funder mandates

Velocity

Speed of algorithms

credit: The Economist

Data synthesis is another bottleneck

  • Is today’s research based on yesterday’s data?
  • The exponential growth of data suggests that most of the data yet to come
  • Revisit and design today’s analyses to account for tomorrow’s data

Example: Synthetic analysis of biodiversity loss

Synthesizes over 140 data sets.

Finds no evidence for systematic loss

How easy would it be to update this to reflect new data?

From bottlenecks to workflows

adapted from Reichman+ (2011) doi:[10.1126/science.1197962](http://doi.org/10.1126/science.1197962)

From bottlenecks to workflows

Reichman+ (2011) doi:[10.1126/science.1197962](http://doi.org/10.1126/science.1197962)

Example workflows: Dynamic documents

Peng (2011) doi:[10.1126/science.1213847](http://doi.org/10.1126/science.1213847)

IPython

Mascarelli (2014) doi:[10.1038/nj7493-523a](http://doi.org/10.1038/nj7493-523a)

Reproducible research

credit: The Economist

credit: The Economist

Sharing code?

Velocity of data vs velocity of policy?

credit: Wikipedia

Conclusions

Global change problems are increasingly data driven, bringing new challenges and opportunities:

  • Volume: We often hit science bottlenecks before hardware bottlenecks.
  • Variety: We need metadata driven repositories for diverse data.
  • Velocity: Workflows for the data of tomorrow

Thank you

  • NSF
  • Sloan Foundation
  • NOAA / CSTAR
  • Karthik Ram
  • Scott Chamberlain
  • Ted Hart
  • Matt Jones
  • Marc Mangel